01_water_quality_data
Read in raw data
The raw data is from Water Quality Data in NYC.
EDA and Data Filtering
All variables:
## [1] "Sample.Number" "Sample.Date"
## [3] "Sample.Time" "Sample.Site"
## [5] "Sample.class" "Residual.Free.Chlorine..mg.L."
## [7] "Turbidity..NTU." "Fluoride..mg.L."
## [9] "Coliform..Quanti.Tray...MPN..100mL." "E.coli.Quanti.Tray...MPN.100mL."
The water quality data contain variables as below (Since we don’t have detailed explanation from the website, all explanation is based on my own understanding):
Sample level:
- Sample Number: A unique identifier assigned to each water sample.
- Sample Date: The date when the water sample was collected. It is usually formatted as MM/DD/YYYY.
- Sample Time: The time when the water sample was taken (timestamp format (ISO 8601)), providing precise timing. Here we don’t need such precise timing.
- Sample Site: The identifier that indicates the location where the
sample was collected. We can map this site to
sample_siteto get the accurate coordinate. - Sample Class: The classification of the sample. For example, “Compliance” indicates that these samples are collected as part of the regulatory monitoring program, while “Operational” help operators manage and maintain system performance, optimize treatment processes, and monitor system conditions. Here we don’t need such precise classification.
Water quality measurement:
- Residual Free Chlorine (mg/L): Free chlorine is used to maintain a disinfectant residual throughout the distribution system.
- Turbidity (NTU): Turbidity is a measure of water clarity. It indicates the amount of suspended particles in the water, which can impact both the appearance and the safety of the water.
- Fluoride (mg/L): Fluoride is often added to public water supplies to help prevent tooth decay.
- Coliform (Quanti-Tray) (MPN/100mL): Coliform bacteria are used as indicators of general water quality and sanitary conditions.
- E.coli (Quanti-Tray) (MPN/100mL): Similar to the coliform count. E. coli is a specific fecal indicator; its presence typically signals fecal contamination, which is a significant health concern.
sample date
water_quality <- water_quality %>%
mutate(date = as.Date(Sample.Date, format = "%m/%d/%Y")) %>%
mutate(year_month = format(date, "%Y-%m")) %>%
mutate(year = format(date, "%Y")) %>%
mutate(month = format(date, "%m")) water_quality %>%
group_by(year, month) %>%
summarise(count = n()) %>%
ungroup() %>%
ggplot(aes(x = month, y = year, fill = count)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Heatmap of Year-Month Combination Count", x = "Month", y = "Year") +
theme_minimal()We can find almost all the time we have the measurement, making it possible to do a time scale analysis.
sample site
We first do the deduplication of the reference sample site file. (Sample Site 39550)
names(sample_site)[c(1,3,4)] = c("Sample.Site","X","Y")
sample_site <- sample_site %>%
distinct(Sample.Site, .keep_all = TRUE)We first map the provided coordinates to the data and remove the NA values.
water_quality <- water_quality %>%
left_join(sample_site %>%
select("Sample.Site", "X", "Y"),
by = "Sample.Site")
water_quality <- water_quality %>%
filter(!is.na(X) & !is.na(Y))The provided sample coordinate seems to be encoded in
EPSG:2263 format. We first try to change this into the
latitude and longitude based encoding
WGS84(EPSG:4326) to make our visualization
easier.
water_quality <- st_as_sf(water_quality, coords = c("X", "Y"), crs = 2263)
water_quality$geometry <- st_transform(water_quality$geometry, 4326)From the changed geometry, we can find they are roughly consistent with the latitude and longitude of NYC, so we would expect the transformation is working.
## Geometry set for 151215 features
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -74.24467 ymin: 40.50231 xmax: -73.71687 ymax: 40.90798
## Geodetic CRS: WGS 84
## First 5 geometries:
We can try to visualize it on the map. Here we get the NYC neighborhood data from official website. The visualization also shows the coordinate mapping is satisfactory.
## Reading layer `nynta2010' from data source
## `/Users/dl3738/Documents/Course/Data Visualization/Group_F_TBD/data/raw_data/nynta2010_25a/nynta2010.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 195 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120128.4 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
## Simple feature collection with 195 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS: WGS 84
## First 10 features:
## BoroCode BoroName CountyFIPS NTACode NTAName
## 1 4 Queens 081 QN08 St. Albans
## 2 2 Bronx 005 BX41 Mount Hope
## 3 4 Queens 081 QN38 Pomonok-Flushing Heights-Hillcrest
## 4 4 Queens 081 QN52 East Flushing
## 5 3 Brooklyn 047 BK44 Madison
## 6 4 Queens 081 QN07 Hollis
## 7 3 Brooklyn 047 BK41 Kensington-Ocean Parkway
## 8 4 Queens 081 QN33 Cambria Heights
## 9 4 Queens 081 QN44 Glen Oaks-Floral Park-New Hyde Park
## 10 4 Queens 081 QN62 Queensboro Hill
## Shape_Leng Shape_Area geometry
## 1 45401.32 77412748 MULTIPOLYGON (((-73.75205 4...
## 2 18937.25 14716711 MULTIPOLYGON (((-73.89561 4...
## 3 30731.42 38835920 MULTIPOLYGON (((-73.7964 40...
## 4 25848.55 29453684 MULTIPOLYGON (((-73.79493 4...
## 5 26237.26 27379162 MULTIPOLYGON (((-73.93754 4...
## 6 20976.34 22887773 MULTIPOLYGON (((-73.75726 4...
## 7 20800.75 15893334 MULTIPOLYGON (((-73.97084 4...
## 8 26168.46 33076727 MULTIPOLYGON (((-73.72678 4...
## 9 33597.01 45658769 MULTIPOLYGON (((-73.7105 40...
## 10 30690.30 26541931 MULTIPOLYGON (((-73.81537 4...
Chlorine
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -9.9900 0.4200 0.5700 0.5624 0.7100 2.2000 49
Turbidity
Extreme value: <0.10
Here for easy analysis, we directly change these values to 0.10.
water_quality <- water_quality %>%
mutate(Turbidity=gsub("[<>]", "", Turbidity)) %>%
mutate(Turbidity=as.numeric(Turbidity)) %>%
filter(!is.na(Turbidity) & Turbidity >= 0)## [1] 332
## [1] 0.002196334
water_quality %>%
filter(Turbidity <= 1.5) %>%
ggplot(aes(x=Turbidity)) +
geom_histogram(binwidth = 0.1)Fluoride
## [1] 19961
Since the NA value for Fluoride is too much and this variable is not very correlated with water quality, we can just discard it.
Coliform
Extreme value: <1, >200.5
water_quality <- water_quality %>%
mutate(Coliform=gsub("[<>]", "", Coliform)) %>%
mutate(Coliform=as.numeric(Coliform)) %>%
filter(!is.na(Coliform) & Coliform >= 0)## [1] 416
## [1] 0.002753253
Ecoli
Extreme value: <1
water_quality <- water_quality %>%
mutate(Ecoli=gsub("[<>]", "", Ecoli)) %>%
mutate(Ecoli=as.numeric(Ecoli)) %>%
filter(!is.na(Ecoli) & Ecoli >= 0)## [1] 1
## [1] 6.618396e-06
Variant selection
coordinates <- st_coordinates(water_quality)
water_quality <- water_quality %>%
mutate(longitude = coordinates[,1]) %>%
mutate(latitude = coordinates[,2])
water_quality_to_save <- water_quality %>%
select(Sample.Number, year_month, year, month, Residual_Chlorine, Turbidity, longitude, latitude, Neighbourhood) %>%
st_drop_geometry()Sample plot
Geometric mapping
We can change the time scale to see different patterns.
plot_data <- water_quality %>%
filter(as.character(year_month) == time) %>%
select(all_of(c(variant, "geometry"))) %>%
group_by(geometry) %>%
summarise(mean_value = mean(.data[[variant]], na.rm = TRUE))
pal <- colorNumeric(palette = "YlOrRd", domain = plot_data[["mean_value"]])
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addPolygons(data = neighborhoods, color = "#444", weight = 1, label = ~NTAName) %>%
addCircleMarkers(
data = plot_data,
radius = 5,
color = pal(plot_data[["mean_value"]]),
fillOpacity = 0.8,
stroke = FALSE,
popup = ~paste(variant, ": ", mean_value)
) %>%
addLegend(
pal = pal,
values = plot_data[["mean_value"]],
title = variant,
position = "bottomright"
)Trend along time scale
time_start <- ym("2023-01")
time_end <- ym("2024-12")
variant <- "Residual_Chlorine"
neighbourhood <- "Morningside Heights"sub_neighbourhood <- neighborhoods %>%
filter(NTAName == neighbourhood)
plot_data <- water_quality %>%
filter(ym(year_month) >= time_start & ym(year_month) <= time_end) %>%
select(all_of(c(variant, "geometry", "year_month"))) %>%
group_by(geometry, year_month) %>%
summarise(mean_value = mean(.data[[variant]], na.rm = TRUE)) %>%
ungroup()
if (neighbourhood != "Whole NYC")
plot_data <- st_filter(plot_data, sub_neighbourhood)
plot_data %>%
ggplot(aes(x=year_month, y=mean_value, group=1)) +
geom_smooth(method = "loess", se=TRUE) +
labs(
title = paste(variant, "change along time in", neighbourhood),
x = "Time",
y = variant
) +
theme(
plot.title = element_text(face = "bold", size = 13),
axis.text.y = element_text(face = "bold"),
axis.text.x = element_text(face = "bold", angle = 45, hjust = 1),
axis.title.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
panel.grid = element_blank(),
panel.background = element_blank(),
plot.background = element_blank(),
axis.line = element_line(color = "black")
)